Zuse Institute Berlin
Crash course on RL
What is importance sampling
Optimal biasing as an RL problem
The things I’d like to connect
A miniopoly board
Important
We train our robot to maximize the rewards as it takes actions exploring the space of states
We calculated transition probability with the knowledge of the dice
What we’ve covered so far
Important
The general idea of importance sampling is to draw random variables from another probability measure and subsequently weight them back in order to still have an unbiased estimator of the desired quantity of interest
Our main goal is to compute \[ \Psi(X) \coloneqq \mathbb{E}^x [I(X)] \coloneqq \mathbb{E}[I(X) \mid X_0 = x] \]
But…
Tip
The new, controlled dynamics are now described as \[\begin{equation} \label{eq: controlled langevin sde} \mathrm dX_s^u = (-\nabla V(X_s^u) + \sigma(X_s^u) \, u(X_s^u))\mathrm ds + \sigma(X_s^u) \mathrm dW_s, \qquad X_0^u = x \end{equation}\]
Via Girsanov, we can relate our QoI to the original as such: \[\begin{equation} \label{eq: expectation IS} \mathbb{E}^x\left[I(X)\right] = \mathbb{E}^x\left[I(X^u) M^u\right], \end{equation}\]
where the exponential martingale \[\begin{equation} \label{eq: girsanov martingale} M^u \coloneqq \exp{\left(- \int_0^{\tau^u} u(X_s^u) \cdot \mathrm dW_s - \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right)} \end{equation}\] corrects for the bias the pushing introduces.
Important
The previous relationship always holds. But the variance of the estimator depends heavily on the choice of \(u\).
Clearly, we aim to achieve the smallest possible variance through on optimal control \(u^*\) \[\begin{equation} \label{eq: variance minimization} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = \inf_{u \in \mathcal{U}} \left\{ \operatorname{Var} (I(X^u) M^u) \right\} \end{equation}\]
It turns out 1 that the problem of minimizing variance corresponds to a problem in optimal control
The cost functional \(J\) to find the variance minimizing control is \[\begin{equation} \label{eq: cost functional} J(u; x) \coloneqq \mathbb{E}^x\left[\mathcal{W}(X^u) + \frac{1}{2} \int_0^{\tau^u} |u(X_s^u)|^2 \mathrm ds \right], \end{equation}\]
With this formulation, \[\begin{equation} \Phi(x) = \inf_{u \in \mathcal{U}} J(u; x). \end{equation}\]
Important
The optimal bias achieves zero variance: \[\begin{equation} \operatorname{Var} \left( I(X^{u^*}) M^{u^*} \right) = 0. \end{equation}\]
The time-discretized objective function is given by \[\begin{equation} \small J(u; x) \coloneqq \mathbb{E}^{x} \left[ g(s_{T_u}) + \sum_{t=0}^{T_{u-1}} f(s_t) \Delta t + \frac{1}{2} \sum_{t=0}^{T_{u-1}} |u(s_t)|^2 \Delta t \right] \end{equation}\]
The connection (Quer and Borrell 2024) works because of the properties of \(J\)
Two posibilities